harp Update - Dec 2024

Andrew Singleton

harpCore

New functions

join_multi_groups() for joining grouping data where rows fall into more than one group

ff <- join_multi_groups(det_point_df, station_groups)
ff
# A tibble: 2 × 2
    SID station_group                                                
  <dbl> <chr>                                                        
1  1001 <All><METCOOP><EWGLAM><Norway 2><NEU coast>                  
2  1002 <All><Svalbard Barents><METCOOP><Norway><Norway 2><NEU coast>


attr(ff, "multi_groups")
$station_group
[1] "<All>"              "<Svalbard Barents>" "<METCOOP>"         
[4] "<EWGLAM>"           "<Norway>"           "<Norway 2>"        
[7] "<NEU coast>"       

harpIO

Testing Apache Arrow with parquet

  • Alternative to SQLite
  • Files are much smaller
  • Reading is much faster [not yet tested on distributed file systems]
  • Writing appears to be faster, but not fully tested
  • Joins between different datasets can be done before collecting - finding common cases and joining observations to forecasts much more memory efficient

Testing Apache Arrow with parquet

  • parquet files cannot be appended to, but…
  • Arrow datasets can be
  • The challenge is finding the right balance for partitioning data
    • don’t want too many small files, but…
    • want datasets to be easily extendable
  • DuckDB also investigated as an alternative.
    • fully self contained package, so is extremely slow to install
    • No clear advantages over Apache Arrow

harpPoint

Memory efficiency

  • Many functions very memory hungry
  • Deterministic verification refactored
  • No extra rows for multiple groups
  • Thresholds treated completely separately
  • Easier to add new scores
  • Similar changes will be ported to ensemble verification

harpVis

New plot method

  • Works on harp_grid_df data frames [output of read_forecast(), read_analysis(), or read_grid(..., data_frame = TRUE)]
  • Powered by ggplot so easy to edit / extend plots
  • Defaults to faceting by valid_dttm column, but you can choose
  • Data are downsampled by a factor dependent on the total number of pixels in the x-direction, though user modifiable

Utility functions for plotting

  • censor_low_squish_high() removes data below the lower limit and “squishes” data above the upper limit
    • Useful for e.g. precipitation plots
  • abs_range() returns the absolute range of data
    • Useful for getting equal limits on colour bars

harp

Work in progress

  • Functions for handling configuration files
  • Running scripts
  • Function to set up a skeleton harp project directory with basic configuration and run scripts